10. TD Control: Theory and Practice
)](img/exploration-vs.-exploitation.png)
Exploration-Exploitation Dilemma (Source)
TD Control: Theory and Practice
## Greedy in the Limit with Infinite Exploration (GLIE)
The Greedy in the Limit with Infinite Exploration (GLIE) conditions were introduced in the previous lesson, when we learned about MC control. There are many ways to satisfy the GLIE conditions, all of which involve gradually decaying the value of \epsilon when constructing \epsilon-greedy policies.
In particular, let \epsilon_i correspond to the i-th time step. Then, to satisfy the GLIE conditions, we need only set \epsilon_i such that:
- \epsilon_i > 0 for all time steps i, and
- \epsilon_i decays to zero in the limit as the time step i approaches infinity (that is, \lim_{i\to\infty} \epsilon_i = 0),
## In Theory
All of the TD control algorithms we have examined (Sarsa, Sarsamax, Expected Sarsa) are guaranteed to converge to the optimal action-value function q_*, as long as the step-size parameter \alpha is sufficiently small, and the GLIE conditions are met.
Once we have a good estimate for q_, a corresponding optimal policy \pi_ can then be quickly obtained by setting \pi_(s) = \arg\max_{a\in\mathcal{A}(s)} q_(s, a) for all s\in\mathcal{S}.
## In Practice
In practice, it is common to completely ignore the GLIE conditions and still recover an optimal policy. (You will see an example of this in the solution notebook.)
## Optimism
You have learned that for any TD control method, you must begin by initializing the values in the Q-table. It has been shown that initializing the estimates to large values can improve performance. For instance, if all of the possible rewards that can be received by the agent are negative, then initializing every estimate in the Q-table to zeros is a good technique. In this case, we refer to the initialized Q-table as optimistic, since the action-value estimates are guaranteed to be larger than the true action values.